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Abstract 

A 1992 study by Gershon found discrepancies when comparing the theoretical Rasch Item 
Characteristic Curve with the average empirical curve for 1,304 vocabulary items administered 
to 7,711 students. When person-item mismatches were deleted (for any person item interaction 
where the ability of the person was much higher or much lower than the difficulty of the item), 
the difference between the theoretical and empirically produced curves was decreased. This 
paper describes a replication of that original study, using data obtained from an administration 
of the California Achievement Test to students in the Minneapolis Public Schools. When 
person-item mismatches were deleted, item calibrations improved regardless of the grade level. 
The results are discussed with an emphasis on the importance of selectively deleting data when 
the primary goal of the analysis is to obtain the most accurate item difficulty estimates possible. 
This research is of particular importance for testing organizations who use item banks and/or 
computerized adaptive testing. 
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Analyzing Multiple Choice Tests with the Rasch Model; 

Improving Item Calibrations by Deleting Person-Item Mismatches 

Ideally, tests should be constructed to test examinees at their ability level. Yet, in many 
testing situations, some of the items on a test are inappropriately easy or difficult for some 
examinees. This situation may arise when both the test items and the test population are very 
heterogeneous, resulting in very able examinees responding to very easy items and vice versa. 
It may also occur when the purpose of the test is to show mastery. In this case many examinees 
may be more able than items are difficult. 

A primary assumption of any assessment model is that only relevant variables are being 
measured. David Andrich (1989) pointed out that person ability can theoretically span an infinite 
range but that a given item should be expected to "operate consistently only in a specific range 
of the variable". Andrich explains, "The range within which a statement (item) may be 
expected to operate consistently can be constrained by considering the probability of a positive 
or negative response; if this is very high (say greater than 80%. . .) then the responses should 
be eliminated.... The responses at the extremes simply should not be expected to work 
accurately." In practice, we often remove misfitting items before final test analysis. For this 
paper, our suggestion is not that entire items or persons be removed from the analysis, but rather 
that the data be edited to eliminate extreme item-person mismatches. 

Previous research (Gershon, 1992) has shown that for vocabulary test items, when 
persons with ability far below that of a given item difficulty were included in the sample, 
guessing played a role in estimating the difficulty of the item. When the observed performance 
of each person-item interaction was compared to predicted performance for this sample. 
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discrepancies were noted in the observed item characteristic curve, particularly at the lower end 
of the scale. When the data set was edited, (person-item interactions were marked as missing 
when the ability of the examinee was more than 2 logits higher or 1 logit lower than the 
difficulty of the item) the discrepancy between the theoretical curve and the empirically unedited 
curve was minimized. 

Other research has shown that when inappropriate samples of examinees are used for item 
calibration, the standard deviation of the item calibrations is constrained (Bergstrom, Gershon 
and Brown, 1993). This constriction results in a decrease in the range of item difficulty 
calibrations obtained, and ultimately in poor targeting of the items which were found to be 
particularly "easy" or "difficult" for the specific sample. 

This paper explores the effect of poorly targeted tests on item parameter estimation. We 
examine the results of marking as missing, item-person interactions where the difficulty of the 
item varies considerably from the ability of the person. We replicate the original Gershon 
(1992) research, which included only vocabulary items, with a more general data set. To our 
knowledge, no additional research has been published or presented on systematically eliminating 
person-item mismatches from multiple choice data. 

Method 

Subjects 

Data are from the Minneapolis Public Schools 1993 administration of the California 
Achievement Test. The California Achievement Test was selected by the Minneapolis Public 
Schools over other nationally normed tests because it most closely matched the district 
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curriculum (Brown, O’Gorman, Rogers and Aim, 1993). Four subtests of the Form E 1985 
standard edition of the California Achievement Test were administered district wide in the spring 
of 1993 to grades 1,2, 3, 4,6, 8 and 10 in vocabulary, reading comprehension, math computation, 
and math concepts and applications. 

We independently analyzed each grade. The results of the analyses were very similar, 
and thus we report only the grade 1 results in this paper. 

Method 

The data were analyzed using BIGSTEPS (Wright and Linacre, 1992) under the following 
three conditions: 

In Condition 1 there was no restriction of data (persons or items). 

In Condition 2, we employed a recently added feature in BIGSTEPS, "CUTHI/LO" that 
allows the data set to be edited according to the specifications of the user.^ "CUTHI" was set 
to 2 and "CUTLO" was set to 1. These parameters instructed the program to (a) estimate aU 
person ability and item difficulty parameters using PROX; (b) examine each person-item 
interaction and mark as missing all person item interactions where the person ability estimate 
is 2 logits greater than the item difficulty estimate and aU person item interactions where the 
person ability estimate is 1 logit lower than the item difficulty estimate; and (c) re-estimate item 
difficulties and person abilities using PROX and UCON iterations on the restricted data set. 

Since most users would not accept person ability measures based on a restricted data set, 
we did an additional BIGSTEPS run in Condition 2 in which we anchored aU item calibrations 

‘ This feature was added subsequent to the research completed in conjunction with the 
original Gershon, 1992 study. 
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to the values obtained in the CUTHI/LO analysis. This run produced person ability measures 
based on the CUTHI/LO calibrations but included aU item-person interactions. 

In Condition 3, the item and person calibrations obtained in the Condition 1 analysis were 
used to edit the original data matrix. Using the CAT Software System (Gershon, 1992) we 
marked as missing aU-l< B-D <2 (where B is the ability of the person and D is the difficulty 
of the item). The edited data set was then analyzed with BIGSTEPS with no additional persons 
or items deleted. 

The difference between Condition 2 and Condition 3 is the point at which the data was 
edited. In Condition 2, the editing occurs after the PROX iterations. In Condition 3, data is 
edited after UCON estimates are obtained. 

The output files generated from the BIGSTEPS runs were then graphed using the Item 
Characteristic Curve option in the CAT Software System (Gershon, 1992). The software 
program examines each person-item interaction using the person and item files generated by 
BIGSTEPS. In addition, it compares the answer key with the raw data file. For every item, 
the item difficulty is subtracted from the person ability. A tally is then kept for each quarter 
logit range on the B-D scale of the percent of items answered correctly. Theoretical, observed 
unedited, and observed edited results are plotted on the same scale. This plot compares the 
theoretical probability of correct response with the observed percent of correct response under 
each condition. 
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RESULTS 

Figure 1 shows the BIGSTEPS (Wright and Linacre, 1992) Map of Persons and Items 
from the Condition 1 analysis. The map indicates that many examinees have greater than 90% 
probability of answering many of the items correctly. It further shows that this data set is 
appropriate for considering the CUTHI/LO option because the distance between person ability 
and item difficulty is extreme for many cases. 

In Figure 2, the Condition 2 and Condition 3 item characteristic curves are shown. 
Figure 2 indicates that Condition 2 and Condition 3 produced identical results. This means that 
the item calibrations and person measures obtained from the BIGSTEPS CUTHI/LO analysis 
where the data was edited after the PROX iterations produced the same results as editing the data 
after UCON iterations. Since the results were effectively the same, and since using the 
CUTHI/LO option is by far the easier procedure, only the results for Condition 1 and Condition 
2 will be further reported. 

Impact of Deleting Item-Person Mismatches on the Item Characteristic Curve 

Figure 3 and Table 1 show the comparison of the theoretical item characteristic curve 
(ICC) versus the unedited ICC and the ICC obtained from the CUTHI/LO analysis. 

At the lower end of the scale where the difficulty of the item far exceeds the ability of 
the person, the Rasch model predicts less than 10% probability that a person will answer the 
item correctly. However, since these were multiple choice items with four distractors, one 
would presume a random probability of guessing an answer correctly closer to 25 % of the time. 
A second discrepancy is observed where there is no difference between the ability of the person 
and the difficulty of the item. At this point, the Rasch model predicts that items are answered 
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correctly 50% of the time (Wright and Stone, 1979). The percent of examinees responding 
correctly in the unedited analysis was .448. When CUTHI/LO is employed, and only person- 
item interactions where -1< B-D <2 are included, the percentage of examinees responding 
correctly was .479. 

Impact of CUTHI/LO on estimated item difficulty 

The impact of using CUTHI/LO on the estimated difficulty of items is seen in Figure 4. 
The correlation for the two sets of item calibrations is .99, but note that the standard deviations 
differ greatly. The standard deviation for the CUTHI/LO calibrations is 1.24 while the standard 
deviation for the unedited calibrations is 1.06. Figure 4 shows the difference between the 
identity line and the slope created by the ratio of the standard deviations (1.17). This difference 
is significant using an F test and comparing the ratio of the variances of the two sets of 
calibrations (F 113113 = 1.37, p<.05). Deleting inappropriate person-item interactions spreads 
the items difficulties when administered to an appropriate sample of examples; the easy items 
are easier and the difficult items are more difficult. 

Impact of CUnn/LO on estimated person ability 

Figure 5 shows a comparison of the ICCs obtained from the Condition 1, unedited 
analysis, the Condition 2, CUTHI/LO analysis and the analysis when all items were anchored 
to the CUTHI/LO item calibrations. The later analysis anchored the item difficulty estimates 
obtained from CUTHI/LO, but all person-item encounters were included regardless of B-D. 
This analysis is comparable to what may typically be done in practice. After the "best" estimate 
for the item calibrations is obtained, item values are anchored and all persons and items are 
included in a subsequent analysis. 




9 



Improving Item Calibrations 

7 



When the CUTHI/LO item calibrations were used but all person-item interactions were 
included in the table and graph, the percentage of examinees responding correctly was .458. 
This finding illustrates the reality of using the CUTHI/LO option when the goal of the analysis 
is to obtain person ability estimates. The estimates obtained using anchored item difficulty 
estimates are better than using an unedited data set. However, they are still subject to a lot of 
noise presumably caused by person misfit due to poorly targeted items. 

Impact of CUTHI/LO on item fit 

A comparison of the item mean squared infit statistic from the unedited analysis and the 
CUTHI/LO analysis (Figure 5) shows that fewer items misfit when CUTHI/LO is employed. 
Although there is no difference in the mean of the two statistics, a comparison of the log of the 
variances of the two analyses is once again significant (Fn 3 n 3 = 1.89, p< .05). More items in 
the unedited analysis fit poorly and more of the items were likely to overfit. 

Discussion 

Most "traditional" paper and pencil based testing formats will produce maps similar to 
the one shown in Figure 1, as tests are often designed to allow persons to answer the vast 
majority of the items correctly. While this practice does little to improve the overall efficiency 
of the testing experience, there is much to be said for allowing examinees a primarily successful 
testing experience. Also, in the case of Minneapolis Public schools, the student population 
included Chapter 1, Special Education students and Limited English Proficient students. The 
range of estimated ability at Grade 1 was over 8 logits. Since no single test will be sufficient 
to examine such a heterogeneous group, the test is targeted to lower ability students. The impact 
of this type of test is to produce a data set in which many of the person-item interactions are 
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clearly inappropriate for accurate item difficulty parameter estimation. 

Editing data sets is an integral part of good data analysis. Item estimation is improved 
when misfitting persons are removed from the data set and person estimation is improved when 
misfitting items are removed from the analysis. A basic tenet of the Rasch model is that good 
quality items have constant values for other so called parameters such as guessing and 
discrimination. This research confirms this to be true especially when appropriately targeted 
persons are included in the sample. Using CUTHI/LO allows for improved item estimation 
based on samples of examinees for whom the item is appropriately targeted even though the 
original (unedited) data might not otherwise be considered appropriate for this purpose. 

The ability of the CUTHI/LO procedure to effectively take a mistargeted data set, and 
make it a good one has several advantages for test developers. Extreme items, which have 
otherwise been overly subject to either (a) the effects of guessing by low able examinees, or (b) 
the effects of sleeping by high able examinees who sometimes get easy items wrong, can now 
be safely included in an analysis. Large scale banking efforts are frequently stymied by poor 
pre-targeting of test items to appropriate ability samples. CUTHI/LO can be used to remove 
the effects of this poor targeting. Of course, this option cannot guarantee that enough data will 
be left to accurately estimate the item difficulty! But, the results of this analysis help us to 
understand that the "data" never really existed in the first place. On the other hand, the use of 
the remaining well targeted data that is associated with CUTHI/LO option may lead many 
psychometricians to re-examine thresholds for the amount of data needed for accurate parameter 
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The significant increase in the standard deviation of the items and the significant decrease 
in the standard deviation of the mean square fit statistic indicates that the CUTHI/LO procedure 
is producing improved item calibration estimates, leading to improved person ability estimates. 
While the "improvement" is relatively small for any single test, the effect is likely to be greatly 
magnified whenever equating is used. CUTHI/LO helps to release the constraints on the item 
difficulty range which are encountered when inappropriate person-item interactions are used to 
estimated item difficulty. 

Editing person-item mismatches will be especially useful with the advent of computer 
adaptive testing. Given that pre-calibrated item banks (often calibrated with paper and pencil 
tests) are being used for adaptive testing, accurate item difficulty estimates are essential. 
Furthermore, since adaptive testing can take advantage of items across the person ability range, 
improved parameter estimation for "easy" and "hard" items becomes increasingly important. 
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Figure 2 



Comparison of Data Editing Procedures 





15 



Improving Item Calibrations 

13 



Figure 3 



Grade 1 Item Characteristic Curves 
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Figure 4 

Unedited Calibrations vs CUTHl/LO Calibrations 
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Figure 5 

The Impact of Anchored Item Calibrations 
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Figure 6 
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